Basic overview
- 32 slides of colorectal tissue samples
- Number of images (“regions of interest”) per slide varies from 12, up to 522
- We assume there are batch effects in the data, e.g. slide to slide variation that is not explainable by biological signal
- For example, we are working with the marker Vimentin (VIM) – a stromal/immune cell marker
- Downstream, we aim to use VIM to build a clustering algorithm to identify different cell types in the tissues
Densities of VIM
- Started with ComBat as our method of choice - Normalization in multimodal data is not optimal
- Basic approaches:
simple and modal
- What is happening with
registr? Is there a better way to tune those spline hyperparams?
marker = "VIM"
ggplot(cb_atl) +
geom_density_ridges(aes(x=get(raw_vars[grepl(marker,raw_vars)]), y = SlideID,fill=SlideID)) +
ggtitle('Raw data') +
theme_minimal() +
theme(legend.position = "none")
## Picking joint bandwidth of 183

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(log_vars[grepl(marker,log_vars)]), y = SlideID,fill=SlideID)) +
ggtitle('Log data') +
theme_minimal() +
theme(legend.position = "none")
## Picking joint bandwidth of 0.102

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(simple_vars[grepl(marker,simple_vars)]), y = SlideID,fill=SlideID)) +
ggtitle('Simple adjusted data') +
theme_minimal() +
theme(legend.position = "none") +
xlim(c(0,5))
## Picking joint bandwidth of 0.117
## Warning: Removed 143274 rows containing non-finite values (stat_density_ridges).

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(simple_vars_centered[grepl(marker,simple_vars_centered)]), y = SlideID,fill=SlideID)) +
ggtitle('Simple adjusted data (centered)') +
theme_minimal() +
theme(legend.position = "none") +
xlim(c(0,5))
## Picking joint bandwidth of 0.0733
## Warning: Removed 765513 rows containing non-finite values (stat_density_ridges).

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(combat_vars[grepl(marker,combat_vars)]), y = SlideID,fill=SlideID)) +
ggtitle('ComBat adjusted (slide only)') +
theme_minimal() +
theme(legend.position = "none") +
xlim(c(0,5))
## Picking joint bandwidth of 0.00231

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(combat_vars_og[grepl(marker,combat_vars_og)]), y = SlideID,fill=SlideID)) +
ggtitle('ComBat adjusted (original method)') +
theme_minimal() +
theme(legend.position = "none")
## Picking joint bandwidth of 0.0357

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(combat_vars_p[grepl(marker,combat_vars_p)]), y = SlideID,fill=SlideID)) +
ggtitle('ComBat adjusted (slide only, with image level variance)') +
theme_minimal() +
theme(legend.position = "none")
## Picking joint bandwidth of 0.0401

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(simple2_vars[grepl(marker,simple2_vars)]), y = SlideID,fill=SlideID)) +
ggtitle('Simple2 adjustment') +
theme_minimal() +
theme(legend.position = "none") +
xlim(c(0,10))
## Picking joint bandwidth of 0.178
## Warning: Removed 52214 rows containing non-finite values (stat_density_ridges).

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(modal_vars[grepl(marker,modal_vars)]), y = SlideID,fill=SlideID)) +
ggtitle('Simple modal adjustment') +
theme_minimal() +
theme(legend.position = "none")
## Picking joint bandwidth of 0.102

ggplot(cb_atl) +
geom_density_ridges(aes(x=get(modal_vars_zeroes_removed[grepl(marker,modal_vars_zeroes_removed)]), y = SlideID,fill=SlideID)) +
ggtitle('Simple modal adjustment (zeroes removed)') +
theme_minimal() +
theme(legend.position = "none")
## Picking joint bandwidth of 0.103

registr
## registr
registr_data = data.frame()
for(s in unique(cb_atl$SlideID)){
dv = density(cb_atl[cb_atl$SlideID == s,
'Median_Cell_VIMENTIN_log10'])
one_slide = data.frame(dv$x,dv$y)
one_slide$slide = s
registr_data = rbind(registr_data,one_slide)
}
colnames(registr_data) = c("index","value","id")
## Kt=7, Kh=3: 589
## Kt=8, Kh=5: 560
## Kt=9, Kh=4: 539
## Kt=10, Kh=5: 519
## Kt=11, Kh=4: 314
## Kt=12, Kh=5: 305
registr_bin1 = register_fpca(Y = registr_data,
family = "gaussian")
registr_bin2 = register_fpca(Y = registr_data,
family = "gaussian",
Kt = 9,
Kh = 4,
npc = 1)
registr_bin3 = register_fpca(Y = registr_data,
family = "gaussian",
Kt = 12,
Kh = 5,
npc = 1)
## remove the zeroes
## log10 data
ggplot(registr_data) +
geom_line(aes(x=index,y=value,color=id)) +
theme(legend.position = "None")

## Modal adjustment
ggplot(cb_atl) +
geom_density(aes(x=Median_Cell_VIMENTIN_log10_modal_adjusted_zeroes_removed,color=SlideID)) +
theme(legend.position = "None")

registr_adj_data1 = registr_bin1$fpca_obj$Yhat
ggplot(registr_adj_data1) +
geom_line(aes(x=index,y=value,color=id)) +
theme(legend.position = "None")

registr_adj_data2 = registr_bin2$fpca_obj$Yhat
ggplot(registr_adj_data2) +
geom_line(aes(x=index,y=value,color=id)) +
theme(legend.position = "None")

registr_adj_data3 = registr_bin3$fpca_obj$Yhat
ggplot(registr_adj_data3) +
geom_line(aes(x=index,y=value,color=id)) +
theme(legend.position = "None")

Confusion matrices
- Using the clusters within slide on the \(log_{10}\) scale as the “reference”/ground truth.
for(c in all_clus[-3]){
## truth: within slide, prediction: across slide
cmat1 = confusionMatrix(data = as.data.frame(cb_atl)[,c],
reference = cb_atl$cluster_within_slide_log)
print("---------------------------------")
print("---------------------------------")
print("---------------------------------")
print(paste0(c," compared to cluster_within_slide_log"))
print(cmat1)
}
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_raw compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1189901 250853
## 2 648599 1504119
##
## Accuracy : 0.7497
## 95% CI : (0.7493, 0.7501)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5017
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6472
## Specificity : 0.8571
## Pos Pred Value : 0.8259
## Neg Pred Value : 0.6987
## Prevalence : 0.5116
## Detection Rate : 0.3311
## Detection Prevalence : 0.4009
## Balanced Accuracy : 0.7521
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_raw compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1440691 1725791
## 2 397809 29181
##
## Accuracy : 0.409
## 95% CI : (0.4085, 0.4095)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.2032
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.78362
## Specificity : 0.01663
## Pos Pred Value : 0.45498
## Neg Pred Value : 0.06834
## Prevalence : 0.51162
## Detection Rate : 0.40092
## Detection Prevalence : 0.88118
## Balanced Accuracy : 0.40013
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_log compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1572782 148389
## 2 265718 1606583
##
## Accuracy : 0.8848
## 95% CI : (0.8844, 0.8851)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7697
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8555
## Specificity : 0.9154
## Pos Pred Value : 0.9138
## Neg Pred Value : 0.8581
## Prevalence : 0.5116
## Detection Rate : 0.4377
## Detection Prevalence : 0.4790
## Balanced Accuracy : 0.8855
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_simple compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1435282 104461
## 2 403218 1650511
##
## Accuracy : 0.8587
## 95% CI : (0.8584, 0.8591)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7184
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7807
## Specificity : 0.9405
## Pos Pred Value : 0.9322
## Neg Pred Value : 0.8037
## Prevalence : 0.5116
## Detection Rate : 0.3994
## Detection Prevalence : 0.4285
## Balanced Accuracy : 0.8606
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_simple compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1232268 113544
## 2 606232 1641428
##
## Accuracy : 0.7997
## 95% CI : (0.7993, 0.8001)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6017
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6703
## Specificity : 0.9353
## Pos Pred Value : 0.9156
## Neg Pred Value : 0.7303
## Prevalence : 0.5116
## Detection Rate : 0.3429
## Detection Prevalence : 0.3745
## Balanced Accuracy : 0.8028
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_cb compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1790548 15859
## 2 47952 1739113
##
## Accuracy : 0.9822
## 95% CI : (0.9821, 0.9824)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9645
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9739
## Specificity : 0.9910
## Pos Pred Value : 0.9912
## Neg Pred Value : 0.9732
## Prevalence : 0.5116
## Detection Rate : 0.4983
## Detection Prevalence : 0.5027
## Balanced Accuracy : 0.9824
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_cb compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1092613 1750934
## 2 745887 4038
##
## Accuracy : 0.3052
## 95% CI : (0.3047, 0.3057)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.4087
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.594296
## Specificity : 0.002301
## Pos Pred Value : 0.384243
## Neg Pred Value : 0.005385
## Prevalence : 0.511622
## Detection Rate : 0.304055
## Detection Prevalence : 0.791309
## Balanced Accuracy : 0.298298
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_simple_centered compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1597073 34729
## 2 241427 1720243
##
## Accuracy : 0.9232
## 95% CI : (0.9229, 0.9234)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8466
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8687
## Specificity : 0.9802
## Pos Pred Value : 0.9787
## Neg Pred Value : 0.8769
## Prevalence : 0.5116
## Detection Rate : 0.4444
## Detection Prevalence : 0.4541
## Balanced Accuracy : 0.9244
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_simple_centered compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1229879 113473
## 2 608621 1641499
##
## Accuracy : 0.7991
## 95% CI : (0.7986, 0.7995)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6005
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6690
## Specificity : 0.9353
## Pos Pred Value : 0.9155
## Neg Pred Value : 0.7295
## Prevalence : 0.5116
## Detection Rate : 0.3423
## Detection Prevalence : 0.3738
## Balanced Accuracy : 0.8021
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_og_combat compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1673617 168303
## 2 164883 1586669
##
## Accuracy : 0.9073
## 95% CI : (0.907, 0.9076)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8145
##
## Mcnemar's Test P-Value : 3.158e-09
##
## Sensitivity : 0.9103
## Specificity : 0.9041
## Pos Pred Value : 0.9086
## Neg Pred Value : 0.9059
## Prevalence : 0.5116
## Detection Rate : 0.4657
## Detection Prevalence : 0.5126
## Balanced Accuracy : 0.9072
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_og_combat compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1788326 1562802
## 2 50174 192170
##
## Accuracy : 0.5511
## 95% CI : (0.5506, 0.5517)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.0839
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9727
## Specificity : 0.1095
## Pos Pred Value : 0.5336
## Neg Pred Value : 0.7930
## Prevalence : 0.5116
## Detection Rate : 0.4977
## Detection Prevalence : 0.9326
## Balanced Accuracy : 0.5411
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_cb_slide_with_image_variance compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1509069 186306
## 2 329431 1568666
##
## Accuracy : 0.8565
## 95% CI : (0.8561, 0.8568)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7133
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8208
## Specificity : 0.8938
## Pos Pred Value : 0.8901
## Neg Pred Value : 0.8264
## Prevalence : 0.5116
## Detection Rate : 0.4199
## Detection Prevalence : 0.4718
## Balanced Accuracy : 0.8573
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_cb_slide_with_image_variance compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1694517 1741977
## 2 143983 12995
##
## Accuracy : 0.4752
## 95% CI : (0.4747, 0.4757)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0724
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.921685
## Specificity : 0.007405
## Pos Pred Value : 0.493095
## Neg Pred Value : 0.082782
## Prevalence : 0.511622
## Detection Rate : 0.471554
## Detection Prevalence : 0.956316
## Balanced Accuracy : 0.464545
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_simple2 compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1535152 112409
## 2 303348 1642563
##
## Accuracy : 0.8843
## 95% CI : (0.884, 0.8846)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.769
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8350
## Specificity : 0.9359
## Pos Pred Value : 0.9318
## Neg Pred Value : 0.8441
## Prevalence : 0.5116
## Detection Rate : 0.4272
## Detection Prevalence : 0.4585
## Balanced Accuracy : 0.8855
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_simple2 compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1459036 121517
## 2 379464 1633455
##
## Accuracy : 0.8606
## 95% CI : (0.8602, 0.8609)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7219
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7936
## Specificity : 0.9308
## Pos Pred Value : 0.9231
## Neg Pred Value : 0.8115
## Prevalence : 0.5116
## Detection Rate : 0.4060
## Detection Prevalence : 0.4398
## Balanced Accuracy : 0.8622
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_simple_modal compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1628559 106002
## 2 209941 1648970
##
## Accuracy : 0.9121
## 95% CI : (0.9118, 0.9124)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8243
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8858
## Specificity : 0.9396
## Pos Pred Value : 0.9389
## Neg Pred Value : 0.8871
## Prevalence : 0.5116
## Detection Rate : 0.4532
## Detection Prevalence : 0.4827
## Balanced Accuracy : 0.9127
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_simple_modal compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1579717 154531
## 2 258783 1600441
##
## Accuracy : 0.885
## 95% CI : (0.8847, 0.8853)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7701
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8592
## Specificity : 0.9119
## Pos Pred Value : 0.9109
## Neg Pred Value : 0.8608
## Prevalence : 0.5116
## Detection Rate : 0.4396
## Detection Prevalence : 0.4826
## Balanced Accuracy : 0.8856
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_within_slide_modal_adjusted_zeroes_removed compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1835939 4278
## 2 2561 1750694
##
## Accuracy : 0.9981
## 95% CI : (0.9981, 0.9981)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9962
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9986
## Specificity : 0.9976
## Pos Pred Value : 0.9977
## Neg Pred Value : 0.9985
## Prevalence : 0.5116
## Detection Rate : 0.5109
## Detection Prevalence : 0.5121
## Balanced Accuracy : 0.9981
##
## 'Positive' Class : 1
##
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "---------------------------------"
## [1] "cluster_across_slide_modal_adjusted_zeroes_removed compared to cluster_within_slide_log"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1581075 158088
## 2 257425 1596884
##
## Accuracy : 0.8844
## 95% CI : (0.884, 0.8847)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7689
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8600
## Specificity : 0.9099
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.8612
## Prevalence : 0.5116
## Detection Rate : 0.4400
## Detection Prevalence : 0.4840
## Balanced Accuracy : 0.8850
##
## 'Positive' Class : 1
##
What now?
- Pursuing an “alignment” approach that implements cost functions to combine the modal and simple adjustments - Effectively tries to optimally align based on “control points” (e.g. min, max, modes, local minima)
- Retooling the “slide only” ComBat – results don’t make sense
- Computer vision?